Quality Control in Large Annotation Projects Involving Multiple Judges: The Case of the TDT Corpora
نویسندگان
چکیده
The Linguistic Data Consortium at the University of Pennsylvania has recently been engaged in the creation of large-scale annotated corpora of broadcast news materials in support of the ongoing Topic Detection and Tracking (TDT) research project. The TDT corpora were designed to support three basic research tasks: segmentation, topic detection, and topic tracking in newswire, television and radio sources from English and Mandarin Chinese. The most recent TDT corpus, TDT3, added two tasks, story link and first story detection. Annotation of the TDT corpora involved a large staff of annotators who produced millions of human judgements. As with any large corpus creation effort, quality assurance and inter-annotator consistency were a major concern. This paper reports the quality control measures adopted by the LDC during the creation of the TDT corpora, presents techniques that were utilized to evaluate and improve the consistency of human annotators for all annotation tasks, and discusses aspects of project administration that were designed to enhance annotation consistency.
منابع مشابه
Multiple Annotations of Reusable Data Resources: Corpora for Topic Detection and Tracking
Responding to demands for very large, easily accessible, reusable news corpora to support research in the topic detection and tracking paradigm, the Linguistic Data Consortium created the TDT corpora. In addition to supporting research in the Topic Detection and Tracking program, the TDT corpora were collected and annotated with an eye toward reuse and re-annotation. Their value is confirmed in...
متن کاملMany Uses, Many Annotations for Large Speech Corpora: Switchboard and TDT as Case Studies
This paper discusses the challenges that arise when large speech corpora receive an ever-broadening range of diverse and distinct annotations. Two case studies of this process are presented: the Switchboard Corpus of telephone conversations and the TDT2 corpus of broadcast news. Switchboard has undergone two independent transcriptions and various types of additional annotation, all carried out ...
متن کامل70 24 v 1 1 3 Ju l 2 00 0 Many Uses , Many Annotations for Large Speech Corpora : Switchboard and TDT as Case Studies
This paper discusses the challenges that arise when large speech corpora receive an ever-broadening range of diverse and distinct annotations. Two case studies of this process are presented: the Switchboard Corpus of telephone conversations and the TDT2 corpus of broadcast news. Switchboard has undergone two independent transcriptions and various types of additional annotation, all carried out ...
متن کاملManagement of Large Annotation Projects Involving Multiple Human Judges: a Case Study of GALE Machine Translation Post-editing
Managing large groups of human judges to perform any annotation task is a challenge. Linguistic Data Consortium coordinated the creation of manual machine translation post-editing results for the DARPA Global Autonomous Language Exploration Program. Machine translation is one of three core technology components for GALE, which includes an annual MT evaluation administered by National Institute ...
متن کاملThe Tdt-3 Text and Speech Corpus
The TDT-3 Text and Speech Corpus expands on previous phases of Topic Detection and Tracking data collections, by increasing the number of news sources being sampled, by including Mandarin Chinese as well as English news data, and by introducing new forms of topic annotation. In order to satisfy the specific data and annotation requirements of the TDT-3 Evaluation Plan[1], the LDC refined and su...
متن کامل